Lecture Week 05: linear models

1 Overview

2 Setup

2.0.1 Libraries

You may need to install the arm package with the following:

install.packages("arm")

Remember to only run this once in your console, and to NOT put it into your script.

library(arm)
library(tidyverse)

2.0.2 Data

For this lab, we will be using data from an experiment that Charles Darwin performed on the effects of self-pollination on Zea maize plants.

  • Darwin was interested in determining if high genetic diversity also led to more healthy plants, and it is assumed that self-pollinated plants will have lower genetic diversity

  • Briefly, Darwin was interested in whether or not there was a difference in fitness of seeds which were produced from self- or cross-pollinated plants.

  • Here, the height of the plants was used as a surrogate for fitness.

  • We will be using the darwin_maize.csv data, which you can download from D2L.

  • Be sure to put it into your data/ folder.

  • When we read in the data, we will assign it to the object called darwin_maize.

darwin_maize <- read.csv("data/darwin_maize.csv",
                stringsAsFactors = TRUE)

3 Outline

  1. What is a model?

  2. What is a linear model?

  3. Basic linear models (problems 1-3, 5)

  4. Confidence intervals for model coefficients (problem 4)

  5. Linear Model Assumptions

4 Lecture

4.0.1 What is a model?

“An informative representation of an object, person, or system.

  • Many types of models

    • Conceptual

    • Graphical

    • Mathematical

  • We will be working with statistical models

    • Mathematical representation of our hypotheses
  • By necessity, models will be simplifications of reality

“All models are wrong, some are useful”

  • Inference requires models

  • Models link observations to processes

4.0.2 What is a linear model?

  • What is the equation of a line?

Answer is probably something similar to:

\[\Huge y = a + bx\]

x <- seq(from = -10, to = 10, by = 0.1)
line_df <- data.frame(x = x,
                      y = -5 + 2 * x)
ggplot(line_df, aes(x = x, y = y)) + 
  geom_path(color = "#446E9B", linewidth = 2) +
  theme_bw()

  • In reality, not every point is exactly on the line, right?
  • Add stochasticity:

\[\Huge y_i = a + bx + \epsilon_i\]

  • \(\epsilon\) is “epsilon”, and often represents “error” or “noise”
set.seed(23)
obs_df <- line_df |>
  sample_n(50) |>
  rowwise() |> 
  mutate(obs = x + rnorm(1, 0, 2))
  
ggplot(line_df, aes(x = x, y = y)) + 
  geom_path(color = "black", linewidth = 2) +
  geom_point(data = obs_df,
             aes(x = obs, y = y),
             size = 2,
             color = "dodgerblue") +
  theme_bw()

4.0.3 Basic linear models (problems 1-3, 5)

  • First, let’s modify our variables in the linear model

  • our intercept was \(a\) but we are going to change it to \(\beta_0\)

    • “Beta zero”
  • the \(b\) will be called \(\beta_1\)

    • “beta one”

    • This is much more flexible as we can have \(\beta_2\), \(\beta_3\)\(\beta_n\)

  • Linear models have the following components:

\[\Large response = deterministic\; part+stochastic\; part\]

  • Write the model showing how our expected value \(E[y_i]\) is a function of:

\[\underbrace{\LARGE E[y] = \beta_0 + \beta_1 \times x_1}_{Deterministic}\]

  • With the addition of a stochastic error:

\[\underbrace{\LARGE y_i \sim normal(E[y_i], \sigma)}_{Stochastic}\]

  • Add all together:

\[\LARGE y_i = \beta_0 + \beta_1 \times x_1 + \epsilon_i\]

  • note that in these examples, the \(i\) subscript corresponds to individual observations

  • The \(1\) subscript refers to the predictor

    • This could be different group identities within a category

    • Or we could have multiple \(x\)’s (we will come back to this later)

4.0.3.1 Global Intercept model

  • If there are no groups and no continuous predictors, we can fit a model which is calculating the global intercept

  • Here, \(x_1 = 0\) so \(\beta_1 \times 0 = 0\) so we can simplify it

  • In other words, this is calculating the average value of our response

\[\LARGE E[y] = \beta_0\]

  • Let’s look at the darwin_maize data
names(darwin_maize)
[1] "pot"    "pair"   "type"   "height"
head(darwin_maize)
  pot pair  type height
1   I    1 Cross 23.500
2   I    1  Self 17.375
3   I    2 Cross 12.000
4   I    2  Self 20.375
5   I    3 Cross 21.000
6   I    3  Self 20.000
  • We are interested in the height column as our response.

    • Later we will look at the type for our groups
  • Let’s plot the response variable, height, in the darwin_maize data

ggplot(darwin_maize,
       aes(y = height,
           x = 0)) +
  geom_point() +
  theme_bw()

  • We can also just plot the mean of this variable
ggplot(darwin_maize,
       aes(x = 0, 
           y = height)) +
  stat_summary(fun = mean, 
               geom = "point",
               color = "black",
               shape = 15, 
               size = 6) +
  theme_bw()

  • Now let’s fit our first linear model using the lm() function

    • When fitting a model, you will use the syntax of:

    • response_column ~ predictor_column

    • In this case there is no predictor column, so we just use the number 1

dlm0 <- lm(formula = height ~ 1,
           data = darwin_maize)
  • dlm0 is a new object type

    • This is a list

    • Each element in the list contains different information about the linear model

mode(dlm0)
[1] "list"
names(dlm0)
 [1] "coefficients"  "residuals"     "effects"       "rank"         
 [5] "fitted.values" "assign"        "qr"            "df.residual"  
 [9] "call"          "terms"         "model"        
str(dlm0)
List of 11
 $ coefficients : Named num 18.9
  ..- attr(*, "names")= chr "(Intercept)"
 $ residuals    : Named num [1:30] 4.62 -1.51 -6.88 1.49 2.12 ...
  ..- attr(*, "names")= chr [1:30] "1" "2" "3" "4" ...
 $ effects      : Named num [1:30] -103.428 -2.221 -7.596 0.779 1.404 ...
  ..- attr(*, "names")= chr [1:30] "(Intercept)" "" "" "" ...
 $ rank         : int 1
 $ fitted.values: Named num [1:30] 18.9 18.9 18.9 18.9 18.9 ...
  ..- attr(*, "names")= chr [1:30] "1" "2" "3" "4" ...
 $ assign       : int 0
 $ qr           :List of 5
  ..$ qr   : num [1:30, 1] -5.477 0.183 0.183 0.183 0.183 ...
  .. ..- attr(*, "dimnames")=List of 2
  .. .. ..$ : chr [1:30] "1" "2" "3" "4" ...
  .. .. ..$ : chr "(Intercept)"
  .. ..- attr(*, "assign")= int 0
  ..$ qraux: num 1.18
  ..$ pivot: int 1
  ..$ tol  : num 1e-07
  ..$ rank : int 1
  ..- attr(*, "class")= chr "qr"
 $ df.residual  : int 29
 $ call         : language lm(formula = height ~ 1, data = darwin_maize)
 $ terms        :Classes 'terms', 'formula'  language height ~ 1
  .. ..- attr(*, "variables")= language list(height)
  .. ..- attr(*, "factors")= int(0) 
  .. ..- attr(*, "term.labels")= chr(0) 
  .. ..- attr(*, "order")= int(0) 
  .. ..- attr(*, "intercept")= int 1
  .. ..- attr(*, "response")= int 1
  .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv> 
  .. ..- attr(*, "predvars")= language list(height)
  .. ..- attr(*, "dataClasses")= Named chr "numeric"
  .. .. ..- attr(*, "names")= chr "height"
 $ model        :'data.frame':  30 obs. of  1 variable:
  ..$ height: num [1:30] 23.5 17.4 12 20.4 21 ...
  ..- attr(*, "terms")=Classes 'terms', 'formula'  language height ~ 1
  .. .. ..- attr(*, "variables")= language list(height)
  .. .. ..- attr(*, "factors")= int(0) 
  .. .. ..- attr(*, "term.labels")= chr(0) 
  .. .. ..- attr(*, "order")= int(0) 
  .. .. ..- attr(*, "intercept")= int 1
  .. .. ..- attr(*, "response")= int 1
  .. .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv> 
  .. .. ..- attr(*, "predvars")= language list(height)
  .. .. ..- attr(*, "dataClasses")= Named chr "numeric"
  .. .. .. ..- attr(*, "names")= chr "height"
 - attr(*, "class")= chr "lm"
  • We can use functions to help us display the important parts of the model

  • Today we will use display() from the arm package.

  • Later, we will use summary()

    • NOT to be confused with summarize()
display(dlm0)
lm(formula = height ~ 1, data = darwin_maize)
            coef.est coef.se
(Intercept) 18.88     0.58  
---
n = 30, k = 1
residual sd = 3.18, R-Squared = 0.00
  • This output is called a table of coefficients.

  • There is a row for each coefficient

    • in this case, just the intercept
  • the columns show the estimate as well as the standard error of the estimate.

  • Coefficient tables in R always list the first coefficient as the intercept

  • it can be sometimes tricky to figure out what it’s actually referring to.

  • In this case, we can confirm it’s the global average by using the mean() function, either on it’s own or with dplyr:

mean(darwin_maize$height)
[1] 18.88333
darwin_maize %>%
  summarize(global_avg = mean(height))
  global_avg
1   18.88333
  • The number of observations is displayed in the table of coefficients as n,

  • The number of parameters estimated is shown with k.

  • It also shows us the \(R^2\) value,

    • tells us how much variation is explained based on the predictor variables used in our model.

    • \(R^2\) values range from 1 (perfect explanation) to 0 (no explanation).

    • In this case we don’t have any predictor variables, so the intercept-only model explains 0% of the variation in the response variable.

The table also shows us the residual standard deviation, which in this case is the same as using sd() since we are ignoring group-levels.

sd(darwin_maize$height)
[1] 3.180953

4.0.4 Returning back to the model

  • This model:

\[\LARGE E[y] = \beta_0\]

  • Only shows us the expected value of \(y\)

    • In other words, just the average
  • But our data has lots of observations

  • In order to account for that variability, we modify our model to be this:

\[\LARGE E[y_i] = \beta_0 + \epsilon_i\]

  • Now, our individual observations ( \(y_i\) ) can be explained by our intercept \(\beta_0\) plus some individual variation, \(\epsilon_i\)

  • So, we can now write out our distribution for \(\epsilon\)

\[\LARGE \epsilon \sim N(0, \sigma)\]

  • So, if we wanted to simulate new data observations with rnorm(), what values would we put in?

    • i.e., mean = ?, sd = ?
    rnorm(10, mean = 18.88, sd = 3.18)
 [1] 12.36042 17.12835 17.03791 20.55950 18.43558 23.79608 22.61248 20.91501
 [9] 18.72571 17.49878
  • Finally, we can plot our full data with our estimated mean to illustrate this
ggplot(darwin_maize,
       aes(y = height,
           x = 0)) +
  geom_point(color = "dodgerblue") +
  stat_summary(fun = mean, 
               geom = "point",
               color = "black",
               shape = 15, 
               size = 6) +
  theme_bw()

4.0.4.1 Linear Model with Categorical Predictor

  • Now let’s add a categorical predictor

    • This is a predictor which classifies our observations into 2 or more groups

    • For the darwin_maize, this is indicated in the type column and has groups of self and cross

  • Let’s go back to our original model

\[\LARGE E[y_j] = \beta_0 + \beta_1 \times x_1\]

  • Here, \(y_j\) is the expected value of each group, where \(y\) is our response, and \(j\) changes with each group

  • To calculate the expected value of each group, we need to introduce “dummy variables

    • for our reference group, \(x_1 = 0\)

    • For our second group, \(x_1 = 1\)

    • If we had more groups, we would have more \(x_j\)’s and they would either be a 0 or 1.

      • We will come back to this when we discuss ANOVA
  • Since \(x_1 = 0\) for our reference group, the expected value is:

\[\LARGE E[y_{reference}] = \beta_0 + \beta_1 \times 0 = \beta_0\]

Important
  • In nearly ALL our models, the intercept ( \(\beta_0\) ) is going to represent the average value of our reference group

  • Figuring out what our reference group is can be tricky

  • Since \(x_1 = 1\) for our other group, the expected value is:

\[\LARGE E[y_{other}] = \beta_0 + \beta_1 \times 1 = \beta_0 + \beta_1\]

Important
  • So to figure out the average value of our other group, we need to add \(\beta_0\) and \(\beta_1\)
  • When there are multiple groups, you always start with \(\beta_0\) and add \(\beta_j\)
  • You won’t need to know if the \(\beta_j\)’s for your model are 0 or 1, but it’s important to know what’s going on behind the scenes.

4.0.4.2 You should now be able to complete problem 1 in the homework assignment.

4.0.4.3 Darwin maize with category

  • Let’s look at our darwin_maize data again

  • We will do the following:

  1. calucate group means

  2. Display the data separated by group and show group-level means

ggplot(darwin_maize,
       aes(x = type, 
           y = height, 
           color = type)) +
  geom_point(position = position_jitter(
               width = 0.1, 
               height = 0),
             size = 2) +
  stat_summary(fun = mean, 
               geom = "point",
               color = "black",
               shape = 8, 
               size = 6) +
  theme_bw() +
  scale_color_viridis_d(option = "inferno",
                        begin = 0.3,
                        end = 0.7)

type_means <- darwin_maize |>
  group_by(type) |>
  summarise(m_height = mean(height),
            sd_height = sd(height))
type_means
# A tibble: 2 × 3
  type  m_height sd_height
  <fct>    <dbl>     <dbl>
1 Cross     20.2      3.62
2 Self      17.6      2.05
  • Now let’s fit a model using the categorical predictor
dlm1 <- lm(height ~ 1 + type,
           data = darwin_maize)
display(dlm1)
lm(formula = height ~ 1 + type, data = darwin_maize)
            coef.est coef.se
(Intercept) 20.19     0.76  
typeSelf    -2.62     1.07  
---
n = 30, k = 2
residual sd = 2.94, R-Squared = 0.18
Note

Note that when you have predictor variables in the formula the 1 is no longer needed, so running lm(height ~ type, data = darwin_maize) will give you the same result.

  • The layout of this table is similar as before, but now we have two rows, one called “(Intercept)” and one called “typeSelf”.

  • Recall that there are only 2 categories in the type variable.

  • Because typeSelf is listed in the table, you would be correct in assuming that (Intercept) is the average height of the plants in the Cross group.

\[\LARGE E[y_{cross}] = \beta_0 = 20.19\]

  • You might also assume that the typeSelf coefficient estimate is the avergae of the other group of plants.

  • However, this coefficient is showing the estimated difference in means.

  • So in order to get the mean for the Self group, we need to subtract the estimated difference from the estimate for the Cross group.

\[\LARGE E[y_{self}] = \beta_0 + \beta_1 \times 1 = 20.19 -2.62 = 17.57\]

  • These are the same values (with rounding) that we calculated manually above

  • Calculating estimated means in R

    • Can manually type the values

    • Or use the coefficient estimates directly

# estimated means
20.19
[1] 20.19
20.19 - 2.62
[1] 17.57
# using the coefficients
coef(dlm1)
(Intercept)    typeSelf 
  20.191667   -2.616667 
# type = self
coef(dlm1)[1]
(Intercept) 
   20.19167 
# type = cross
coef(dlm1)[1] + coef(dlm1)[2]
(Intercept) 
     17.575 
Note

The “(Intercept)” here is the name of the vector position.

You can remove it with this:

cross_est <- coef(dlm1)[1] + coef(dlm1)[2]
names(cross_est) <- NULL
cross_est
[1] 17.575

box 6.1 from New Statistics 2nd ed. by Andy Hector.
  • SE mean for the intercept is the same formula we used in previous classes.

  • The SE difference is calculated with a different formula:

\[\large SED = \sqrt{\frac{s^2_{1}}{n_1} + \frac{s^2_{2}}{n_2}}\]

  • This is a “pooled” estimate of variation

  • I won’t ask you to do this manually, but just be aware that using var() on each group will not get you the correct values.

4.0.4.4 You should now be able to complete problem 2 in the homework assignment.

4.0.5 Confidence intervals for model coefficients (problem 4)

  • Recall that a 95% CI is approximated by:

\[\large 95\% CI = \bar{y} \pm 2*SEM\]

  • Similar to the SED calculation above, there is uncertainty in our estimates, our SEM and SED’s so we cannot simply multiply these together to estimate a 95% CI.

  • Luckily, there is an R function, confint() which specifically calculates confidence intervals for coefficients from linear models.

  • This function does not work on vectors. It only works on fitted model objects

confint(dlm1)
               2.5 %     97.5 %
(Intercept) 18.63651 21.7468231
typeSelf    -4.81599 -0.4173433
  • The output of confint() is the lower bound (2.5%) and the upper bound (97.5%) of the confidence interval.

  • So the 95% confidence interval for the average height of the Cross plants is: 18.64, 21.75, and the 95% CI for the average difference between the two groups is: -4.82, -0.42.

4.0.6 Interpretation

  • Recall that the hypothesis was that self-pollinating plants were less fit due to lower genetic diversity.

  • What does our data indicate?

  • If there was NO EFFECT of self-pollination, we would expect the heights of each group to be the same.

  • Or in other words, the difference ( \(\beta_1\) ) between the groups would be 0.

    • \(\beta_1 = 0\) means NO difference in group-level means

    • \(\beta_1 \ne 0\) means there IS A DIFFERENCE between the group-level means

  • The 95% CI for \(\beta_1\) is completely negative (does not include 0).

  • We are fairly confident that there is a difference in plant height.

  • Because the mean height of the Self-group of plants is lower, we can conclude the difference in height is consistent with the original hypothesis (self-pollinated plants are shorter and therefore less-fit).

  • The arm package has a nice function for displaying coefficient estimates: coefplot(). This function also requires the object to be a model fit (will not work on a vector).

coefplot(dlm1, xlim = c(-5, 0))

  • I added the xlim = c(-5, 0) argument to ensure that the plot displayed 0, which is our value of interest.

  • When you’re doing this on the homework, you may need to play around with different limits on the x-axis.

  • So, at the 95% confidence level, we can say that the there is a decrease in mean plant height in self-pollinated plants.

  • We can test this same hypothesis at a higher level of confidence, i.e., 99% CI

4.0.7 99% CI and coefplot()

  • We can change the level of certainty in both the confint() and coefplot() functions

  • We can calculate a 99% CI directly by adding level = 0.99 to the confint() function.

  • Unfortunately, the coefplot() function only allows us to change the level of standard deviation. We can set sd = 3 to get approximately the same values.

confint(dlm1, level = 0.99)
                0.5 %     99.5 %
(Intercept) 18.093790 22.2895433
typeSelf    -5.583512  0.3501789
coefplot(dlm1, sd = 3)

  • Notice that at the 99% confidence level (or at 3 * the standard deviation), the interval now crosses 0.

  • In other words, at this more strict level of confidence, we cannot say that there is a difference in height between the two plant groups.

4.0.8 You should now be able to complete problem 3 in the homework assignment.

4.0.9 Linear Model Assumptions

  • It’s always important to remember that all models have assumptions, and you should check that those assumptions are reasonably met.

  • The 4 assumptions of a linear model are:

  1. Linearity
  2. Normality of residuals
  3. Equal variance (homoscedasticity) of residuals
    • For categorical data, need to ensure equal variance across groups
  4. Independence
  • Assumption 1 is usually assessed graphically and will be more important when we get to regression analysis (continuous x-variables).

    • For categorical data it is less often tested directly.
  • Testing assumption 4 directly is tough, so usually we assess this assumption based on our understanding of how the data were generated and observed.

  • For assumptions 2 and 3, even with categorical data, we can assess with graphs.

4.0.9.1 Assumption 2: normal distribution of residuals

  • Use a Q-Q plot

  • shows our standardized residuals compared with our theoretical quantiles.

  • We can do this using ggplot().

Note
  • We are using aes(sample = ) function has an argument called sample.

  • We do not use an x or y argument.

ggplot(darwin_maize, aes(sample = height)) +
    stat_qq() + # plot qq points
    stat_qq_line() # reference line

  • If the residuals were perfectly normal, all the points would be exactly on the line.

  • Here, the data in the middle fit pretty well, but the observations at the extreme ends are rather far off the line.

  • Luckily, linear models are fairly robust to this assumption.

  • In this case, it doesn’t look perfect, but it’s not terrible.

  • We will discuss how to deal with non-normal residuals in the future.

4.0.9.2 Assumption 3: equal variance of residuals

  • We will compare the fitted model values with the residuals.

    • Residuals are a measure of how far away the observation is from the prediction

      • Draw an estimated value on the board

      • Draw an empirical observation

      • Show the distance

  • For this we can again use ggplot(),

    • however, this time we are plotting the model object, not the raw data directly.

    • Note that I once again added a “jitter” argument to the width (x-axis) to help visualize the individual points but did not modify the position on the height (y-axis)

ggplot(dlm1,
       aes(x = .fitted, y = .resid)) +
    geom_point(
      position = position_jitter(
        width = 0.1,
        height = 0)) +
    geom_hline(yintercept = 0)
Warning: `fortify(<lm>)` was deprecated in ggplot2 3.6.0.
ℹ Please use `broom::augment(<lm>)` instead.
ℹ The deprecated feature was likely used in the ggplot2 package.
  Please report the issue at <https://github.com/tidyverse/ggplot2/issues>.

  • What we are looking for here is that the minimum and maximum y-values are the same across the x-values, and that they are approximately equally distributed above and below the reference line (y = 0).

    • Basically, no pattern moving left to right

    • Same heights (max and min values)

  • Here, we have a few points that are well below the rest (difference in range)

    • negative values below -3
  • It appears to be unequal (most points above the 0 reference line).

  • Similar to the results above, this is not perfect, but also not terrible.

4.0.9.3 Categorical data: Check variance within groups

  • When we have categorical predictors, we also want to ensure that the variance between groups is approximately the same.

  • We can do this with a boxplot of the groups.

ggplot(darwin_maize,
       aes(x = height,
           y = type)) +
  geom_boxplot()+
  labs(title = "Example of approx. equal variances across groups")

  • Here, the widths of the boxplots are approximately equal

  • Means that each group has approximately the same observed variation.

  • The dots on the left side for each group are not ideal, but this doesn’t look too concerning.

  • What would be concerning is if the widths of the boxes were different sizes. For example:

set.seed(2112)
df <- data.frame(x = c(rnorm(10, 10, 1), 
                       rnorm(10, 20, 10)), 
                 group = rep(c("A", "B"), each = 10))
ggplot(df, 
       aes(y = group, 
           x = x)) +
  geom_boxplot() +
  labs(title = "Example of unequal variances across groups")

4.0.10 You should now be able to complete problem 4 in the homework assignment.

4.0.11 Re-level your data to change the “reference” level

  • We can modify the data to estimate the mean and SE for the other plant group (Self).

  • Right now, the type column is a factor.

class(darwin_maize$type)
[1] "factor"
levels(darwin_maize$type)
[1] "Cross" "Self" 
  • The default behavior for R is to treat characters alphabetically.

  • We can change this by setting the levels of type manually

  • I will first make a new object called darwin_factor, which will be an exact copy of darwin_maize.

  • I will then change the levels of the type variable directly.

# copy the data object
darwin_factor <- darwin_maize

# change the class of the type column
# and set the levels
darwin_factor$type <- factor(
  darwin_factor$type,
  levels = c("Self", "Cross"))

# what is the class of the new type column?
class(darwin_factor$type)
[1] "factor"
# what are the order of the levels?
levels(darwin_factor$type)
[1] "Self"  "Cross"
  • Now, we can perform the same linear model analysis as above with the Self group as our reference level.
display(lm(height~type, data = darwin_factor))
lm(formula = height ~ type, data = darwin_factor)
            coef.est coef.se
(Intercept) 17.58     0.76  
typeCross    2.62     1.07  
---
n = 30, k = 2
residual sd = 2.94, R-Squared = 0.18
  • Now we can see that the estimate and SE for the Intercept coefficient are different.

  • Likewise the second row of the table now says typeCross instead of typeSelf.

4.0.12 You should now be able to complete the homework assignment.

5 Summary

  • Models are ways of estimating group-level statistics from empirical data

  • Important to think about the mathematical formula

    • What group-level mean is represented by \(\beta_0\)?

    • How do you calculate the mean of the other group(s)?

      • \(\beta_0 + \beta_j\)
  • You can re-level your data to set your “reference” level

  • All models have assumptions

  • Be sure to check your data to see if assumptions are reasonable